Assessing Human and Automated Quality Judgments in the French MT Evaluation Campaign CESTA

نویسندگان

  • Olivier Hamon
  • Anthony Hartley
  • Andrei Popescu-Belis
  • Khalid Choukri
چکیده

This paper analyzes the results of the French MT Evaluation Campaign, CESTA (2003-2006). The details of the campaign are first briefly described. The paper then focuses on the results of the two runs, which used human metrics, such as fluency or adequacy, as well as automated metrics, mainly based on n-gram comparison and word error rates. The results show that the quality of the systems can be reliably compared using these metrics, and that the adaptability of some systems to a given domain – which was the focus of CESTA’s second run – is not strictly related to their intrinsic performance. Introduction The French MT evaluation campaign, CESTA, completed its two phases in 2006. The goal of the campaign was to evaluate the output quality of commercial and academic systems translating into French from English and Arabic, and to assess their adaptability to a new subject domain. CESTA also studied the reliability of automatic evaluation metrics with French as a target language and produced a number of reusable language resources for MT evaluation. This article analyzes the scores and rankings of the systems in various conditions, from a meta-evaluation point of view. One of the main questions that are discussed is the level of agreement between human judgments of translation quality and the scores of the automated metrics. While such metrics have been developed and studied for English as a target language, this article discusses their applicability to French. Other meta-evaluation questions include the reliability of human scores, the influence of reference translations on evaluation results, and the use of automated metrics to analyze reference translations. The article is organized as follows. First, the overall organization of the campaign is outlined, focussing on the human and automated evaluation metrics that were used. Then, the data used for the two runs is described, with automated evaluation scores being used to estimate the variability of the reference translations. The reliability of the scores is finally discussed, first intrinsically for human scores, and then in terms of correlation of automatic scores with human scores. The results show that the automatic scores are consistent with human ones, and that they are able to indicate reliably the ranking of the systems and their capacity to adapt to a new domain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CESTA: First Conclusions of the Technolangue MT Evaluation Campaign

This article outlines the evaluation protocol and provides the main results of the French Evaluation Campaign for Machine Translation Systems, CESTA. Following the initial objectives and evaluation plans, the evaluation metrics are briefly described: along with fluency and adequacy assessed by human judges, a number of recently proposed automated metrics are used. Two evaluation campaigns were ...

متن کامل

Evaluation of Machine Translation with Predictive Metrics beyond BLEU/NIST: CESTA

In this paper, we report on the results of a full-size evaluation campaign of various MT systems. This campaign is novel compared to the classical DARPA/NIST MT evaluation campaigns in the sense that French is the target language, and that it includes an experiment of meta-evaluation of various metrics claiming to better predict different attributes of translation quality. We first describe the...

متن کامل

How Much Data is Needed for Reliable MT Evaluation? Using Bootstrapping to Study Human and Automatic Metrics

Evaluating the output quality of machine translation system requires test data and quality metrics to be applied. Based on the results of the French MT evaluation campaign CESTA, this paper studies the statistical reliability of the scores depending on the amount of test data used to obtain them. Bootstrapping is used to compute standard deviation of scores assigned by human judges (mainly of a...

متن کامل

Is my Judge a good One?

This paper aims at measuring the reliability of judges in MT evaluation. The scope is two evaluation campaigns from the CESTA project, during which human evaluations were carried out on fluency and adequacy criteria for English-to-French documents. Our objectives were threefold: observe both interand intra-judge agreements, and then study the influence of the evaluation design especially implem...

متن کامل

Work-In-Progress Project Report: CESTA - Machine Translation Evaluation Campaign

CESTA, the first European Campaign dedicated to MT Evaluation, is a project labelled by the French Technolangue action. CESTA provides an evaluation of six commercial and academic MT systems using a protocol set by an international panel of experts. CESTA aims at producing reusable resources and information about reliability of the metrics. Two runs will be carried out: one using the system’s b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007